75 research outputs found
Boosting Variational Inference: an Optimization Perspective
Variational inference is a popular technique to approximate a possibly
intractable Bayesian posterior with a more tractable one. Recently, boosting
variational inference has been proposed as a new paradigm to approximate the
posterior by a mixture of densities by greedily adding components to the
mixture. However, as is the case with many other variational inference
algorithms, its theoretical properties have not been studied. In the present
work, we study the convergence properties of this approach from a modern
optimization viewpoint by establishing connections to the classic Frank-Wolfe
algorithm. Our analyses yields novel theoretical insights regarding the
sufficient conditions for convergence, explicit rates, and algorithmic
simplifications. Since a lot of focus in previous works for variational
inference has been on tractability, our work is especially important as a much
needed attempt to bridge the gap between probabilistic models and their
corresponding theoretical properties
Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees
Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe
(FW) algorithms regained popularity in recent years due to their simplicity,
effectiveness and theoretical guarantees. MP and FW address optimization over
the linear span and the convex hull of a set of atoms, respectively. In this
paper, we consider the intermediate case of optimization over the convex cone,
parametrized as the conic hull of a generic atom set, leading to the first
principled definitions of non-negative MP algorithms for which we give explicit
convergence rates and demonstrate excellent empirical performance. In
particular, we derive sublinear () convergence on general
smooth and convex objectives, and linear convergence () on
strongly convex objectives, in both cases for general sets of atoms.
Furthermore, we establish a clear correspondence of our algorithms to known
algorithms from the MP and FW literature. Our novel algorithms and analyses
target general atom sets and general objective functions, and hence are
directly applicable to a large variety of learning settings.Comment: NIPS 201
SOM-VAE: Interpretable Discrete Representation Learning on Time Series
High-dimensional time series are common in many domains. Since human
cognition is not optimized to work well in high-dimensional spaces, these areas
could benefit from interpretable low-dimensional representations. However, most
representation learning algorithms for time series data are difficult to
interpret. This is due to non-intuitive mappings from data features to salient
properties of the representation and non-smoothness over time. To address this
problem, we propose a new representation learning framework building on ideas
from interpretable discrete dimensionality reduction and deep generative
modeling. This framework allows us to learn discrete representations of time
series, which give rise to smooth and interpretable embeddings with superior
clustering performance. We introduce a new way to overcome the
non-differentiability in discrete representation learning and present a
gradient-based version of the traditional self-organizing map algorithm that is
more performant than the original. Furthermore, to allow for a probabilistic
interpretation of our method, we integrate a Markov model in the representation
space. This model uncovers the temporal transition structure, improves
clustering performance even further and provides additional explanatory
insights as well as a natural representation of uncertainty. We evaluate our
model in terms of clustering performance and interpretability on static
(Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST
images, a chaotic Lorenz attractor system with two macro states, as well as on
a challenging real world medical time series application on the eICU data set.
Our learned representations compare favorably with competitor methods and
facilitate downstream tasks on the real world data.Comment: Accepted for publication at the Seventh International Conference on
Learning Representations (ICLR 2019
The Incomplete Rosetta Stone Problem: Identifiability Results for Multi-View Nonlinear ICA
We consider the problem of recovering a common latent source with independent
components from multiple views. This applies to settings in which a variable is
measured with multiple experimental modalities, and where the goal is to
synthesize the disparate measurements into a single unified representation. We
consider the case that the observed views are a nonlinear mixing of
component-wise corruptions of the sources. When the views are considered
separately, this reduces to nonlinear Independent Component Analysis (ICA) for
which it is provably impossible to undo the mixing. We present novel
identifiability proofs that this is possible when the multiple views are
considered jointly, showing that the mixing can theoretically be undone using
function approximators such as deep neural networks. In contrast to known
identifiability results for nonlinear ICA, we prove that independent latent
sources with arbitrary mixing can be recovered as long as multiple,
sufficiently different noisy views are available
Shortcuts for causal discovery of nonlinear models by score matching
The use of simulated data in the field of causal discovery is ubiquitous due
to the scarcity of annotated real data. Recently, Reisach et al., 2021
highlighted the emergence of patterns in simulated linear data, which displays
increasing marginal variance in the casual direction. As an ablation in their
experiments, Montagna et al., 2023 found that similar patterns may emerge in
nonlinear models for the variance of the score vector , and introduced the ScoreSort algorithm. In this work, we
formally define and characterize this score-sortability pattern of nonlinear
additive noise models. We find that it defines a class of identifiable
(bivariate) causal models overlapping with nonlinear additive noise models. We
theoretically demonstrate the advantages of ScoreSort in terms of statistical
efficiency compared to prior state-of-the-art score matching-based methods and
empirically show the score-sortability of the most common synthetic benchmarks
in the literature. Our findings remark (1) the lack of diversity in the data as
an important limitation in the evaluation of nonlinear causal discovery
approaches, (2) the importance of thoroughly testing different settings within
a problem class, and (3) the importance of analyzing statistical properties in
causal discovery, where research is often limited to defining identifiability
conditions of the model
Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling
This paper provides statistical sample complexity bounds for score-matching
and its applications in causal discovery. We demonstrate that accurate
estimation of the score function is achievable by training a standard deep ReLU
neural network using stochastic gradient descent. We establish bounds on the
error rate of recovering causal relationships using the score-matching-based
causal discovery method of Rolland et al. [2022], assuming a sufficiently good
estimation of the score function. Finally, we analyze the upper bound of
score-matching estimation within the score-based generative modeling, which has
been applied for causal discovery but is also of independent interest within
the domain of generative models.Comment: Accepted in NeurIPS 202
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
The key idea behind the unsupervised learning of disentangled representations
is that real-world data is generated by a few explanatory factors of variation
which can be recovered by unsupervised learning algorithms. In this paper, we
provide a sober look at recent progress in the field and challenge some common
assumptions. We first theoretically show that the unsupervised learning of
disentangled representations is fundamentally impossible without inductive
biases on both the models and the data. Then, we train more than 12000 models
covering most prominent methods and evaluation metrics in a reproducible
large-scale experimental study on seven different data sets. We observe that
while the different methods successfully enforce properties ``encouraged'' by
the corresponding losses, well-disentangled models seemingly cannot be
identified without supervision. Furthermore, increased disentanglement does not
seem to lead to a decreased sample complexity of learning for downstream tasks.
Our results suggest that future work on disentanglement learning should be
explicit about the role of inductive biases and (implicit) supervision,
investigate concrete benefits of enforcing disentanglement of the learned
representations, and consider a reproducible experimental setup covering
several data sets
Stochastic Frank-Wolfe for Composite Convex Minimization
A broad class of convex optimization problems can be formulated as a
semidefinite program (SDP), minimization of a convex function over the
positive-semidefinite cone subject to some affine constraints. The majority of
classical SDP solvers are designed for the deterministic setting where problem
data is readily available. In this setting, generalized conditional gradient
methods (aka Frank-Wolfe-type methods) provide scalable solutions by leveraging
the so-called linear minimization oracle instead of the projection onto the
semidefinite cone. Most problems in machine learning and modern engineering
applications, however, contain some degree of stochasticity. In this work, we
propose the first conditional-gradient-type method for solving stochastic
optimization problems under affine constraints. Our method guarantees
convergence rate in expectation on the objective
residual and on the feasibility gap
Rotating Features for Object Discovery
The binding problem in human cognition, concerning how the brain represents
and connects objects within a fixed network of neural connections, remains a
subject of intense debate. Most machine learning efforts addressing this issue
in an unsupervised setting have focused on slot-based methods, which may be
limiting due to their discrete nature and difficulty to express uncertainty.
Recently, the Complex AutoEncoder was proposed as an alternative that learns
continuous and distributed object-centric representations. However, it is only
applicable to simple toy data. In this paper, we present Rotating Features, a
generalization of complex-valued features to higher dimensions, and a new
evaluation procedure for extracting objects from distributed representations.
Additionally, we show the applicability of our approach to pre-trained
features. Together, these advancements enable us to scale distributed
object-centric representations from simple toy to real-world data. We believe
this work advances a new paradigm for addressing the binding problem in machine
learning and has the potential to inspire further innovation in the field
- …